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A Simplified Min-Sum Decoding Algorithm 
for Non-Binary LDPC Codes 

Chung-Li (Jason) Wang, Xiaoheng Chen, Zongwang Li, and Shaohua Yang 

Abstract 

Non-binary low-density parity-check codes are robust to various channel impairments. However, 
based on the existing decoding algorithms, the decoder implementations are expensive because of their 
excessive computational complexity and memory usage. Based on the combinatorial optimization, we 
present an approximation method for the check node processing. The simulation results demonstrate that 
our scheme has small performance loss over the additive white Gaussian noise channel and independent 
Rayleigh fading channel. Furthermore, the proposed reduced-complexity realization provides signifi- 
cant savings on hardware, so it yields a good performance-complexity tradeoff and can be efficiently 
implemented. 

Index Terms 

Low-density parity-check (LDPC) codes, non-binary codes, iterative decoding, extended min-sum 
algorithm. 

I. Introduction 

Binary low-density parity-check (LDPC) codes, discovered by Gallager in 1962 [TJ, were 
rediscovered and shown to approach Shannon capacity in the late 1990s |2|. Since their redis- 
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covery, a great deal of research has been conducted in the study of code construction methods, 
decoding techniques, and performance analysis. With hardware-efficient decoding algorithms 
such as the min-sum algorithm [3|, practical decoders can be implemented for effective error- 
control. Therefore, binary LDPC codes have been considered for a wide range of applications 
such as satellite broadcasting, wireless communications, optical communications, and high- 
density storage systems. 

As the extension of the binary LDPC codes over the Galois field of order q, non-binary 
LDPC (NB-LDPC) codes, also known as g-ary LDPC codes, were first investigated by Davey 
and MacKay in 1998 [4]. They extended the sum-product algorithm (SPA) for binary LDPC codes 
to decode g-ary LDPC codes and referred to this extension as the g-ary SPA (QSPA). Based 
on the fast Fourier transform (FFT), they devised an equivalent realization called FFT-QSPA 
to reduce the computational complexity of QSPA for codes with q as a power of 2 Q. With 
good construction methods [|5j-[(9j, NB-LDPC codes decoded with the FFT-QSPA outperform 
Reed-Solomon codes decoded with the algebraic soft-decision Koetter-Vardy algorithm JTOj . 

As a class of capacity approaching codes, NB-LDPC codes are capable of correcting symbol- 
wise errors and have recently been actively studied by numerous researchers. However, despite the 
excellent error performance of NB-LDPC codes, very little research contribution has been made 
for VLSI decoder implementations due to the lack of hardware-efficient decoding algorithms. 
Even though the FFT-QSPA significantly reduces the number of computations for the QSPA, 
its complexity is still too high for practical applications, since it incorporates a great number 
of multiplications in probability domain for both check node (CN) and variable node (VN) 
processing. Thus logarithmic domain approaches were developed to approximate the QSPA, 
such as the extended min-sum algorithm (EMSA), which applies message truncation and sorting 



to further reduce complexity and memory requirements [11|, [12|. The second widely used 



algorithm is the min-max algorithm (MMA) [13|, which replaces the sum operations in the CN 



processing by max operations. With an optimal scaling or offset factor, the EMSA and MMA 
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can cause less than 0.2 dB performance loss in terms of signal-to-noise ratio (SNR) compared 
to the QSPA. However, implementing the EMSA and MMA still requires excessive silicon area, 



making the decoder considerably expensive for practical designs [14|-[17|. Besides the QSPA 
and its approximations, two reliability-based algorithms were proposed towards much lower 
complexity based on the concept of simple orthogonal check-sums used in the one-step majority- 



logic decoding [18|. Nevertheless, both algorithms incur at least 0.8 dB of SNR loss compared 
to the FFT-QSPA. Moreover, they are effective for decoding only when the parity-check matrix 
has a relatively large column weight. Consequently, the existing decoding algorithms are either 
too costly to implement or only applicable to limited code classes at cost of huge performance 
degradation. 

Therefore, we propose a reduced-complexity decoding algorithm, called the simplified min- 
sum algorithm (SMSA), which is derived from our analysis of the EMSA based on the combina- 
torial optimization. Compared to the QSPA, the SMSA shows small SNR loss, which is similar 
to that of the EMSA and MMA. Regarding the complexity of the CN processing, the SMSA 
saves around 60% to 70% of computations compared to the EMSA. Also, the SMSA provides an 
exceptional saving of memory usage in the decoder design. According to our simulation results 
and complexity estimation, this decoding algorithm achieves a favorable tradeoff between error 
performance and implementation cost. 

The rest of the paper is organized as follows. The NB-LDPC code and EMSA decoding are 
reviewed in Section [IIJ The SMSA is derived and developed in Section III The error performance 



simulation results are summarized in Section [TV] In Section IVl the SMSA is compared with the 



EMSA in terms of complexity and memory usage. At last, Section VI concludes this paper. 



II. NB-LDPC Codes and Iterative Decoding 

Let GF(g) denote a finite field of q elements with addition © and multiplication cg>. We will 
focus on the field with characteristic 2, i.e., q = 2 P . In such a field, each element has a binary 
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representation, which is a vector of p bits and can be translated to a decimal number. Thus we 
label the elements in GF(2 P ) as {0, 1, 2, ... 2 P - 1}. An (n, r) g-ary LDPC code C is given by 
the null space of an m x n sparse parity-check matrix H = [hij] over GF(g), with the dimension 
r. 

The parity-check matrix H can be represented graphically by a Tanner graph, which is a 
bipartite graph with two disjoint variable node (VN) and check node (CN) classes. The j-th VN 
represents the j-th column of H, which is associated with the j-th symbol of the g-ary codeword. 
The i-th CN represents its i-th row, i.e., the i-th g-ary parity check of H. The j-th VN and i-th 
CN are connected by an edge if h i: j ^ 0. This implies that the j-th code symbol is checked by the 
i-th parity check. Thus for < i < m and < j < n, we define JVj = { j : < j < n, h iy j ^ 0}, 
and Mj ■ = {i : < i < n, h iy j ^ 0}. The size of iVj is referred to as the CN degree of the i-th 
CN, denoted as |A^|. The size of Mj is referred to as the VN degree of the j-th VN, denoted as 
\Mj\. If both VN and CN degrees are invariable, letting d v = \Mj\ and d c = \Ni\, such a code 
is called a (d v , d c ) -regular code. Otherwise it is an irregular code. 

Similarly as binary LDPC codes, g-ary LDPC codes can be decoded iteratively by the message 
passing algorithm, in which messages are passed through the edges between the CNs and VNs. 
In the QSPA, EMSA, and MMA, a message is a vector composed of q sub-messages, or simply 
say, entries. Let \j = [Aj(0), A 3 -(l), . . . , Xj(q — 1)] be the a priori information of the j-th code 
symbol from the channel. Assuming that Xj is the j-th code symbol, the ci-th sub-message of \j 
is a log-likelihood reliability (LLR) defined as Xj(d) = log(Prob(X :) = Zj)/Pwb(Xj = d)). Zj is 
the most likely (ML) symbol for Xj, i.e., Zj = arg max rf6GF ( ? ) Prob(X J = d), and z = [zj\j=i... n . 
The smaller Xj(d) is, the more likely Xj = d is. Let a i: j and f3 i: j be the VN-to-CN (V2C) 
and CN-to-VN (C2V) soft messages between the i-th CN and j-th VN respectively. For all 
d E GF(g), the d-th entry of a i: j, denoted as a i: j(d), is the logarithmic reliability of d from 
the VN perspective. a it j is the symbol with the smallest reliability, i.e., the ML symbol of the 
V2C message. With x i:j = Xj ® hij, we let cti,j(d) = log(Prob(xi ; j = a it j) /Prob(x i: j = d)) and 
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a i,j( a i,j) = 0- Kj an( ^ A,j(^) are defined from the CN perspective similarly. The EMSA can be 
summarized as follows. 

Algorithm 1. The Extended Min-Sum Algorithm 

Initialization: Set Zj = arg mindeGFfe) A 3 -(d). For all i,j with /ijj ^ 0, set aij(hi t j <g> d) — Xj(d). 
Set k = 0. 

• Step 1) Parity check: Compute the syndrome z <g> H T . If z <g> H T = 0, stop decoding and 
output z as the decoded codeword; otherwise go to Step 2. 

• Step 2) If k = /t max , stop decoding and declare a decoding failure; otherwise, go to Step 3. 

• Step 3) CN processing: Let the configurations Ci(x it j = d) be the sequence [xij>]j> e Ni such 
that YlfeNi X i,j' = an d x i,j — d. With a preset scaling factor < c < 1, compute the 
C2V messages by 



• Step 4) VN processing: n <— n + 1. Compute V2C messages in two steps. First compute 
the primitive messages by 



i'eMj\i 

• Step 5) Message normalization: Obtain V2C messages by normalizing with respect to the 
ML symbol 




(1) 




(2) 



Oj j = arg min cb a 



(3) 



(4) 



• Step 6) Tentative Decisions: 




(5) 
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arg min \Jd). (6) 

deGF( g ) 



• Go to Step 1. 



III. A Simplified Min-Sum Decoding Algorithm 



In this section we develop the simplified min-sum decoding algorithm. In the first part, we 
analyze the configurations and propose the approximation of the CN processing. Then in the 
second part, a practical scheme is presented to achieve the tradeoff between complexity and 
performance. 



A. Algorithm Derivation and Description 

In the beginning, two differences between the SMSA and EMSA are introduced. First, the 
SMSA utilizes a^j (fejj) as the V2C (C2V) hard message, which indicates the ML symbol given 
by the V2C (C2V) message. Second, the reordering of soft message entries in the SMSA is 
defined as: 

OL it j(8) = a iyj (5 © oy) (7) 
M8)=Pij(8®l>ij), (8) 

for all i,j with h^j ^ 0. While in the EMSA the arrangement of entries is made by the absolute 
value, the SMSA arranges the entries by the relative value to the hard message, expressed and 
denoted as the deviation 5. Thus before the CN processing of the SMSA, the messages are 
required to be transformed from the absolute space to the deviation space. 

Equation ([T]) performs the combinatorial optimization over all configurations. If we regard 
the sum of reliabilities £)j'eiv-\j a i,j'( x i,j') as the reliability of the configuration lxij>]j>eN t , this 
operation actually provides the most likely configuration and assigns its reliability to the result. 
However, the size of its search space is of 0(q dc ) and leads to excessive complexity. Fortunately, 
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in pTTp it is observed that the optimization tends to choose the configuration with more entries 
equal to the V2C hard messages. Therefore, if we define the order as the number of all f 6 Ni\j 
such that Xij> ^ aiji, ([!]) can be reduced by utilizing the order-/;; subset, denoted as Cf\xtj = d), 
which consists of the configurations of orders not higher than k. Limiting the size of the search 



space gives a reduced- search algorithm with performance loss [ 11 1, so adjusting k can be used to 
give a tradeoff between performance and complexity. We denote the order-A; C2V soft message 
by (3^ (with the subscript i, j omitted for clearness), i.e. 

hM < P {k \ d ) = (fe) min tij'faj')' ( 9 ) 

since C\ k '(x i} j = d) C Ci(x it j = d). In the following context, we will show the computations 
for the hard message and order-1 soft message. Then these messages will be used to generate 
high-order messages. The hard message is simply given by Theorem 1. 



Theorem 1. The hard message b, t »• is determined by 



bij = arg mm pu(d) = gu,-. (10) 



Besides, for any order k, (3ij(bij) = ftj(0) = (3^ k \bij) = 0. 



PROOF From ([9]) the inequality is obtained as: 



(3 (k) (d) > y~] min a i;j >(xij>) = V" (a^v). (11) 

z — ' x, .,eGF(<j) — ' 

j'eNi\j '- J feNi\j 



If Xi t j = bij and Xij> = a^ji for all j' E N{\j, we get an order-0 configuration, included 
in C\ (xij = bij) for any k. Thus one can find that the equation (11) holds if d = bij, and 
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j3^(bij) has the smallest reliability. It follows that for any k 

PiAkj)=P {k) (hj)= «u'K J ') = o. (12) 

□ 

Based on Theorem [Ij for any k we can define the order- A; message 0^ k \S) = (3^(5 (Bhj) in 
the deviation space. For 5^0, the order-1 C2V message $^(5) can be determined by Theorem 
[2} which performs a combinatorial optimization in the deviation space. 

Theorem 2. With 5 = b^j © d, the order-1 soft message is determined by 



mm 

j"eNi\j 



[ ^2 a i,j'( a i,j')+ a i,j"( a i,j"® 5 )) (13) 



min 6c ir {5). 

]"eNi\j 



PROOF According to the definition of the order, each configuration in Cp{xi j = d) has 



j» © 5 for some j" e Ni\j and Xij/ = for all f e iVj \ {j,j"}, since d@ (aij» 



S) © J2f^{jj"} a i,j' = 0- ^ follows that selecting j" e iVj \ j is equivalent to selecting an order-1 
configuration in cf\xij = d). Correspondingly, minimizing over j" in the deviation 

space is equivalent to minimizing a i: j" (a, j» © 5) over the configurations in the absolute space. 



Hence searching for j" to minimize the sum in the bracket of (13) yields (3^(d). □ 



Similarly to Theorem |2j in the absolute space an order-A; configuration can be determined by 
assigning a deviation to each of k VNs selected from Ni \j, i.e., Xij> = 5f © a^f with 5f ^ 
for selected VNs and 5j> = for all other VNs. Thus in the deviation space, the order-A; message 
can be computed as follows: 



Theorem 3. With 5 = bij © d, choosing a combination of k symbols from GF(q) (denoted 
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as 8\ . . .5k) and picking a permutation of k different VNs from the set iVj \ j (denoted as 
3ii hi ■ ■ ■ i3k)> the order-k soft message is given by 

k 



pW(d) = pW(8) = mm min Y 6c i>h {5 t ). (14) 
Theorem [3] shows that the configuration set can be analyzed as the Cartesian product of the 



set of symbol combinations and that of VN permutations. For Equation (14) the required set of 
combinations can be generated according to Theorem |4} 



Theorem 4. The set of k-symbol combinations 5i . . . 5k for (14) can be obtained by choosing k 



symbols from GF(q) of which there exists no subset with the sum equal to 0. 

PROOF Suppose that there exists a subset 1Z in {1, ...A;} such that Y^feK^t = ®- With a 
modified A;-symbol combination that 5i = for all £ E 1Z and 5e = 5i for all t G {1, . . . k} \ 1Z, 
we have 

k k 

= a ide (5 £ ) <Y^,jM^ ( 15 ) 

t=\ ee{i,...k}\R e=i 

where Yl^^i^ = X] 9 ^=i^ = ^- Thus the original combination can be ignored. □ 

Directly following from Theorem [4J Lemma [5] shows that (3^ k \5) of order k > p is equal to 
(3^ (5), since the combinations with more than p nonzero symbols can be ignored. 

Lemma 5. With q = TP, for all 5 G GF(q), we have 

ftp) (8) = ftP +1 \5) = ... = (3(5) (16) 



Proof $^ k \5) is determined in (14) by searching for the optimal /c-symbol combination 



^®£ =1 <5£ = 5. Assuming that some 5t is 0, this combination is equivalent to the (k — l)-symbol 
combination and has been considered for ft k ^(5). Otherwise if all symbols are nonzero, with 
k > p + 1, we can consider the p x k binary matrix B of which the £-th column is the binary 
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vector of 8g. Since the rank is at most p, it can be proved that there must exist a subset 1Z in 
{1, . . . k} such that Yl%n $t = 0- Following from Theorem |4j the A;-symbol combination can be 
ignored, but the equivalent (k — 1 7?. |) -symbol combination has been considered for 0^ k '^>(8). 
Consequently, after ignoring every combination of more than p nonzero symbols, the search 
space for (3^ k \8) becomes equivalent to that for $^ p \8). It implies that 0^ k '(8) must be equal to 
ft p \8). □ 

By the derivations given above, we have proposed to reduce the search space significantly 
in the deviation space, especially for the larger check node degree and smaller field. Lemma [5] 
also yields the maximal configuration order required by ([I]), i.e., min(d c — l,p). Moreover, in 



(14), the k VNs are chosen from Ni \j without repetition. However, if k VNs are allowed to 



be chosen with repetition, the search space will expand such that (14) can be approximated by 
the lower bound: 

k 

P (8) > min min ••• min /^dtij,(8t) 



min > min a* j, (8e) 

k 



mm 



E e ti^=«5 t=i 



where the last equation follows from (13). Therefore, the SMSA can be carried out as follows: 



Algorithm 2. The Simplified Min-Sum Algorithm 

Initialization: Set Zj = arg min rfeGF ( 9 ) Xj(d). For all i,j with h itj ^ 0, set and 
a it j(h i: j <S> 8) = Xj(8 © zj). Set k = 0. 

. Step 1) and 2) (The same as Step 1 and 2 in the EMS A) 

CN processing: Step 3.1-4 
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Step 3.1) Compute the C2V hard messages: 



feNi\j 

• Step 3.2) Compute the step-1 soft messages: 

ft 1 ) (8) = min a i}j/ (8). (19) 

• Step 3.3) Compute the step-2 soft messages by selecting the combination of k symbols 
according to Theorem [4] 

k 

= min X^gOfc). (20) 

. Step 3.4) Scaling and reordering: With < c < 1, 0i,j{5) w c ■ (3^(5). 
For 7^ bij, f3ij(d) = fiijibij © c£); otherwise ^(b^) = 0. 

• Step 4) (The same as Step 4 in the EMSA) 

• Step 5) Message normalization and reordering: 

a>i j = arg min aa(d). (21) 

deGF(q) ,J 

oiij{d) = &ij(d) - a id (a id ). (22) 

5^(5) = a id (8 © aij). (23) 

. Step 6) (The same as the Step 6 in the EMSA) 
. Go to Step 1. 



As a result, the soft message generation is conducted in two steps (Step 3.2 and 3.3). To 
compute C2V messages /3 i: j, first in Step 3.2 we compute the minimal entry values min.,/ 04 j (5) 
over all f e iV; \ j for each 5 £ GF(q) \ 0. Then in Step 3.3, the minimal values are used to 
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generate the approximation of /3ij(5). Instead of the configurations of all d c VNs in Ni, (20) 



optimizes over the combinations of k symbols chosen from the field. Comparing Theorem [3] to 



< [T9| ) and ( |20| ), we can find that by our approximation method, in the SMSA, the optimization is 
performed over the VN set and symbol combination set separately and thus has the advantage 
of a much smaller search space. 

B. Practical Realization 

Because of the complexity issue, the authors of [11] suggested to use k = 4 for ([T]), as using 



k > 4 is reported to give unnoticeable performance improvement. Correspondingly, we only 



consider a small A; for (20). But it is still costly to generate all combinations with the large finite 



field. For example, with a 64-ary code there are totally ( M = 2016 combinations for k = 2 and 

= 635376 for k = 4. Even with Theorem 4 applied, the number of required combinations 
can be proved to be of 0(q k ). For this reason, we consider a reduced-complexity realization 
other than directly transforming the algorithm into the implementation. It can be shown that for 
S[ ®5' 2 = 5 with 5[ = and 6 2 = £®JU+i^ and 1 < h < k, in SMSA 0"(5) can also 

be approximated by 

(h k 

(h k 
min min } aij e (Si) + min min > aij e (8i 



> min /3'(i )+W) , 



(24) 



where $'(5) denotes the primitive message, that is the soft message of any order lower than the 
required order k. Hence we can successively combine two 2-symbol combinations to make a 
4-symbol one by two sub-steps with a look-up table (LUT), in which all 2-symbol combinations 
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TABLE I 

The look-up table D for GF(2 3 ). 



s\f 





1 


2 


3 


1 


(0,1) 


(2,3) 


(4,5) 


(6,7) 


2 


(0,2) 


(1,3) 


(4,6) 


(5,7) 


3 


(0,3) 


(1,2) 


(4,7) 


(5,6) 


4 


(0,4) 


(1,5) 


(2,6) 


(3,7) 


5 


(0,5) 


(1,4) 


(2,7) 


(3,6) 


6 


(0,6) 


(1,7) 


(2,4) 


(3,5) 


7 


(0,7) 


(1,6) 


(2,5) 


(3,4) 



Algorithm 3 Generate the look-up table for GF(g). 

1: for 6' = 1 . . ~q - 1 do 
2: for S" = (8'® l)...q- 1 do 
3: 6 = 5'® 5"; 
4: D(6).AM(6',5"); 
5: end 

6: end 



are listed. This method allows us to obtain A;-symbol combinations using log 2 k sub-steps, with 
k equal to a power of 2. Based on this general technique, in the following we will select k to 
meet requirements for complexity and performance, and then practical realizations are provided 
specifically for different k. 

The approximation loss with a small k results from the reduced search, with the search space 
size of 0(q k ). According to Theorem [5j the full-size search space is of p-symbol combinations, 
with the size of 0(q p ). As the size ratio between two spaces is of 0(q p ~ k ), the performance 
degradation is supposed to be smaller for smaller fields, k = 1 was shown to have huge 



performance loss for NB-LDPC codes [11 1. By the simulation results in Section IV setting 
k = 2 will be shown to have smaller loss with smaller fields when compared to the EMSA. And 
having k = 4 will be shown to provide negligible loss, with field size q up to 256. Since we 
observed that using k > 4 gives little advantage, two settings k = 2 and k = 4 will be further 
investigated in the following as two tradeoffs between complexity and performance. 
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Let us first look at the required LUT. Shown in Algorithm [3j the pseudo code generates the 
list of combinations (Si, 5 2 ) without repetition for each target 5 with 5\ ©<5 2 = 5. Since we have 
q/2 combinations for each of q — 1 target, D can be depicted as a two-dimensional table with 
q — 1 rows and q/2 columns. For 1 < d < (q — 1) and < / < q/2 — 1, each cell D§j in 
the table is a two-tuple containing two elements Dsj(0) and Dsj(l), which satisfy the addition 
rule Dsj(0) © Dsj(l) = 5. For example, when q = 8, the LUT is provided in Table |lj 



Step 3.3 and (20) can be realized by Step 3.3.1 and 3.3.2 given below. 

• Step 3.3.1) With the LUT D, compute the step-1 messages by 

fij(S) = f min (p%(Dsj(0)) + ^(D SJ (1))) . (25) 

/=0...q/2-l V J J J 

• Step 3.3.2) Compute the step-2 messages by 

^(5) = mm (4(^/(0)) + ;;.,(/;,/( 1))). (26) 

/=0...g/4-l V J J J 

By the definition, we let $J(0) = ^.(0) = 0, so ^(D Sfi (0)) + pQ(D 6fi (l)) = 0^(5). 

The first sub-step combines two symbols Dsj(0) and Dsj(l) for each 5 and /, making a 2- 
symbol combination. The comparison will be conducted over / = . . . q/2— 1 for each 5. Assume 
that the index of the minimal value is f*(5). Then the second sub-step essentially combines 
two two-tuples D Ds f (o),f*(D s f (o)) and -Dd 5/ (i),/*(d 5/ (i)) 5 making a 4-symbol combination. It can 
be proved that all 4-symbol combinations can be considered by combining two-tuples Dgj of 
/ = 0, 1, . . . q/A — 1. So the second sub-step only performs the left half of the Table D. For 
instance, over GF(2 3 ) the left half of D is formed by / = 0, 1 in Table [ij 

For k = 2 and k = 4 respectively, we define two versions of SMSA, i.e., the one-step SMSA 
(denoted as SMSA-1) and the two-step SMSA (denoted as SMSA-2). The SMSA-1 is the same 
as the SMSA-2 except for the implementation of Step 3.3. The SMSA-1 only requires Step 3.3.1 
and skips Step 3.3.2, while the SMSA-2 implements both steps. We will present the performance 
and complexity results of the SMSA-1 and SMSA-2 in the following sections. 
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IV. Simulation Results 

In this section, we use five examples to demonstrate the performance of the above proposed 
SMSA for decoding NB-LDPC codes. The existing algorithms including the QSPA, EMSA, and 
MMA are used for performance comparison. The SMSA includes the one-step (SMSA-1) and 
two-step (SMSA-2) versions. In the first two examples, three codes over GF(2 4 ), GF(2 6 ), and 
GF(2 8 ) are considered. We show that the SMSA-2 has very good performance for different finite 
fields and modulations. And the SMSA-1 has small performance loss compared to the SMSA- 
2 over GF(2 4 ) and GF(2 5 ). The binary phase-shift keying (BPSK) and quadrature amplitude 
modulation (QAM) are applied over the additive white Gaussian noise (AWGN) channel. In the 
third example, we study the fixed-point realizations of SMSA and find that it is exceptionally 
suitable for hardware implementation. The fourth example compares the performance of the 
SMSA, QSPA, EMSA, and MMA over the uncorrelated Rayleigh-fading channel. The SMSA-2 
shows its reliability with higher channel randomness. In the last example, we research on the 
convergence speed of SMSA and show that it converges almost as fast as EMSA. 

Example 1. (BPSK-AWGN) Three codes constructed by computer search over different finite 
fields are used in this example. Four iterative decoding algorithms (SMSA, QSPA, EMSA, and 
MMA) are simulated with the BPSK modulation over the binary-input AWGN channel for every 
code. The maximal iteration number n max is set to 50 for all algorithms. The bit error rate (BER) 
and block error rate (BLER) are obtained to characterize the error performance. The first code is 
a rate-0.769 (3,13)-regular (1057,813) code over GF(2 4 ), and its error performance is shown in 
Fig. [TJ We use optimal scaling factors c = 0.60, 0.75, and 0.73 for the SMSA-1, SMSA-2, and 
EMSA respectively. The second code is a rate-0.875 (3, 24) -regular (495,433) code over GF(2 6 ), 
and its error performance is shown in Fig. [2j We use optimal scaling factors c = 0.50, 0.70, and 
0.65 for the SMSA-1, SMSA-2, and EMSA respectively. The third code is a rate-0.70 (3,10)- 
regular (273,191) code over GF(2 8 ), and its error performance is shown in Fig. [3] We use optimal 
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scaling factors c = 0.35, 0.575, and 0.60 for the SMSA-1, SMSA-2, and EMS A respectively. 
Taking the EMSA as a benchmark at BLER of 10~ 5 , we observe that the SMSA-2 has SNR loss 
of less than 0.05 dB, while the MMA suffers from about 0.1 dB loss. The SMSA-1 has 0.06 
dB loss with GF(2 4 ) and almost 0.15 dB loss with GF(2 6 ) and GF(2 8 ) against the EMSA. As 



discussed in Section III-B the SMSA-1 performs better with smaller fields. At last, the QSPA 



has SNR gain of less than 0.05 dB and yet is viewed as undesirable for implementation. 

Example 2. ( QAM-AWGN) Fig. [4] shows the performance of the 64-ary (495,433) code, the 
second code in Example [T] with the rectangular 64-QAM. Four decoding algorithms (SMSA, 
QSPA, EMSA, and MMA) are simulated with finite field symbols directly mapped to the grey- 
coded constellation symbols over the AWGN channel. The maximal iteration number n max is set 
to 50 for all algorithms. The SMSA-1, SMSA-2, and EMSA have the optimal scaling factors 
c = 0.37, 0.60, and 0.50 respectively. We note that the SMSA-2 and EMSA achieve nearly the 
same BER and BLER, while the MMA and SMSA-1 have 0.11 and 0.14 dB of performance 
loss. 

Example 3. (Fixed-Point Analysis) To investigate the effectiveness of the SMSA, we evaluate 
the block error performance of the (620,310) code over GF(2 5 ) taken from j9). The parity-check 
matrix of the code is a 10 x 20 array of 31 x 31 circulant permutation matrices and zero matrices. 
The floating-point QSPA, EMSA, MMA, SMSA-1, and SMSA-2 and the fixed-point SMSA-1 
and SMSA-2 are simulated using the BPSK modulation over the AWGN channel. The BLER 
results are shown in Fig. [5} The optimal scaling factors for the SMSA-1, SMSA-2, and EMSA 
are c = 0.6875, 0.6875, and 0.65 respectively. The maximal iteration number fc max is set to 50 
for all algorithms. Let / and F denote the number of bits for the integer part and fraction part 
of the quantization scheme. We observe that for SMSA-1 and SMSA-2 five bits (/ = 3, F = 2) 
are sufficient. For approximating the QSPA and EMSA, the SMSA-2 has SNR loss of only 0. 1 
dB and 0.04 dB at BLER of 10~ 4 , respectively. And the SMSA-1 has SNR loss of 0.14 dB and 
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0.08 dB respectively. 

Example 4. (Fading Channel) To test the reliability of the SMSA, we examine the error 
performance of the 32-ary (620,310) code given in Example 3 over the uncorrelated Rayleigh- 
fading channel with additive Gaussian noise. The channel information is assumed to be known 
to the receiver. The floating-point QSPA, EMS A, MMA, SMSA-1, and SMSA-2 are simulated 
using the BPSK modulation, as the BLER results are shown in Fig. |6j Compared to the EMSA, 
the SNR loss of SMSA-2 is within 0.1 dB, while the SMSA-1 and MMA have around 0.2 dB 
loss. The QSPA has performance gain in low and medium SNR regions and no gain at high 
SNR. 

Example 5. (Convergence Speed) Consider again the 32-ary (620,310) code given in Example 
3. The block error performances for this code using the SMSA-2 and EMSA with 4, 5, 7, 
and 10 maximal iterations are shown in Fig. [7] At BLER of 1CT 3 , the SNR gap between the 
SMSA-2 and EMSA is 0.04 dB for various « max . To further investigate the convergence speed, 
we summarize the average number of iterations for the EMSA and SMSA-2 with 20, 50, and 
100 maximal iterations and show the results in Fig. [8] It should be noted that shown in Fig. [5J 
the SNR gap of BLER between EMSA and SMSA-2 is about 0.04 dB, at BLER of 1(T 3 and 
SNR of about 2.2 dB . By examining the curves of Fig. [8] at SNR of about 2.2 dB (in the partial 
enlargement), we observe that for the same average iteration number the difference of required 
SNR is also around 0.04 dB between the two algorithms. Since a decoding failure increases the 
average iteration number, the SNR gap of error performance can be seen as the main reason for 
the SNR gap of average iteration numbers. Therefore, as the failure occurs often at low SNR 
and rarely at high SNR, in Fig. [8] the iteration increase for SMSA-2 at high SNR is negligible 
(< 5% at 2.2 dB), and at low and medium SNR the gap is larger (« 11% at 1.8 dB). Although 
the result is not shown, we observe that the SMSA-1 also has similar convergence properties, 
and the iteration increase compared with the EMSA at high SNR is around 6%. 
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V. Complexity Analysis 



In this section, we analyze the computational complexity of the SMSA and compare it with 
the EMSA. The comparison of average required iterations is provided in Example [5] of Section 



IV With a fixed SNR, the SMSA requires slightly more (5 ~ 6%) number of average iterations 
than the EMSA at medium and high SNR region. As the two algorithms have small (within 
0.2 dB) performance difference, especially between the EMSA and SMSA-2, we think that it 
is fair to simply compare the complexity of the SMSA and EMSA by the computations per 
iteration. Moreover, since the VN processing is similar for both algorithms, we only analyze the 
CN processing. The required operation counts per iteration for a CN with degree d c are adopted 
as the metric. 

To further reduce the duplication of computations in CN processing, we propose to transform 
the Step 3.1 and 3.2 of SMSA as follows. Step 3.1 can be transformed into two sub-steps. We 
define 

Ai = ^2 a i,j'- 



Then each &j can be computed by 



Thus totally it takes 2d c — 1 finite field additions to compute this step for a CN. 

Similarly, the computation of Step 3.2 can be transformed into two sub-steps. For the i-th row 
of the parity-check matrix, we define a three-tuple {minlj(5), min2j(<5), idxj(<5)}, in which 

minlj^) = min aiji(5), 

ai,id Xl (<5)(5) = minli(5), 

min2i(<5) = min an/ (6). 

j'eNi\idKi{S) ' J 
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TABLE II 

The required operations per iteration and memory usage to perform the CN processing of a CN with 

DEGREE d c FOR A g-ARY CODE. THE BIT WIDTH PER SUB-MESSAGE IS W. 



Type 


SMSA-1 


SMSA-2 


EMSA 


Finite Field Additions 


24 - 1 


2d c - 1 





Summations 


(?/2-i)(?-iK 


(3g/4-2)(g-l)d c 


3(4 - 2)q 2 


Comparisons and Selections 


(( ? /2 + 2)d c -3)(«-l) 


((32/4 + 1)4-3X9-1) 


3(4-2)g(g-l) 


Memory Usage (Bits) 


(2w+\\og 2 dcl)(«-l) 


(2w+\\og 2 d c ])(q-l) 


wd c q 




+pd c 


+pd c 





For each nonzero symbol 5 in GF(q), it takes at most 1 + 2(4 — 2) = 2d c — 3 min operations, 
and each operation can be realized by a comparator and multiplexor to compute the 3 -tuple 
{minli(5), min2 i (5), idx^S)}. 
The remaining computations of Step 3.2 can be computed equivalently by 

^(5)=mmh(5) if idx,(5); 

$ 1 J(S)=mha i (6) ifj = idx,. 

It takes d c comparisons and d c two-to-one selections to perform the required operations. As there 
are q — 1 entries of (3^ , the overall computations of Step 3.2 per CN requires (34 — 3)(g — 1) 
comparators and (34 — 3)(g — 1) multiplexors. 

For the SMSA-2, Step 3.3 is realized by Step 3.3.1 and 3.3.2. To compute Step 3.3.1 for each 
symbol 8, it takes (q — 2)/2 summations and (q — 2)/2 comparisons. To compute Step 3.3.2 for 
each S, it takes (q — 4)/4 summations and (q — 4)/4 comparisons. Therefore, totally it takes 
3g/4 — 2 summations and min operations for each 5. As we have q — 1 nonzero symbols in GF(q), 
overall it requires (3g/4 — 2)(q — 1)4 summations, comparisons, and two-to-one selections. For 
the SMSA-1, it requires (q/2 — l)(g — 1)4 summations, comparisons, and two-to-one selections. 
Step 3.4 performs scaling and shifting and thus is ignored here, since the workload is negligible 
compared to LLR calculations. 
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Then let us analyze the CN processing in EMSA for comparison. As in fT4| , [ 15 1, [ 17 1, usually 



the forward-backward scheme is used to reduce the implementation complexity. For a CN with 
degree d c , 3(d c — 2) stages are required, and each stage needs q 2 summations and (q — l)q 
min operations. Overall, in the EMSA, each CN has q 2 d c summations, (q — l)qd c comparisons, 
and (q — l)qd c two-to-one selections. The results for the SMSA and EMSA are summarized in 
Table [IlJ As in implementation the required finite field additions of SMSA take only marginal 
area, we see that the SMSA requires much less computations compared to the EMSA. 

Since the computational complexity for decoding NB-LDPC codes is very large, the decoder 
implementations usually adopt partially-parallel architectures. Therefore, the CN-to-VN messages 
are usually stored in the decoder memory for future VN processing. As memory occupies 
significant amount of silicon area in hardware implementation, optimizing the memory usage 



becomes an important research problem [14|, [15|, [17|. For Step 3.2 of SMSA, the 3-tuple 
{minlj(5), min2j(5), idx,((5)} can be used to recover the messages ^ (5) for all j G iVj. Assume 
that the bit width for each entry of the soft message is w in the CN processing. Then for each 
5 in GF(q), the SMSA needs to store 2w + |~log 2 d c ] bits for the 3-tuple. Also, it needs to 
store the hard messages ay in Step 3.1, which translate to p x d c bits of storage. To store the 
intermediate messages for the CN processing of each row, totally the SMSA requires to store 
(2w+ [log 2 d c ] )(q — 1) +pd c bits. In comparison, for the EMSA, there is no correlation between 
(3i,j(d) of each j G Ni in the i-th CN. Therefore, the EMSA requires to store the soft messages 
Oi%,j{d) of all j G Ni, which translate to w x d c x q bits. We see that the SMSA requires much 
less memory storage compared to the EMSA. 



We take as an example the (620,310) code over GF(2 5 ) used in Section IV With w = 5 and 
d c = 6, the SMSA-1 requires 2790 summations, 3255 comparisons, and 433 memory bits for 
each CN per iteration, and the SMSA-2 requires 4092 summations, 4557 comparisons, and 433 
memory bits. The EMSA requires 12288 summations, 11904 comparisons, and 960 memory 
bits. As a result, compared to the EMSA, the SMSA-1 saves 77% on summations and 73% on 
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comparisons, and the SMSA-2 saves 67% and 62% respectively. Both of the two SMSA versions 
save 55% on memory bits. More hardware implementation results are presented for SMSA-2 in 



1 19 1, which shows exceptional saving in silicon area when compared with existing NB-LDPC 



decoders. 



VI. Conclusions 

In this paper, we have presented a hardware-efficient decoding algorithm, called the SMSA, 
to decode NB-LDPC codes. This algorithm is devised based on significantly reducing the search 
space of combinatorial optimization in the CN processing. Two practical realizations, the one-step 
and two-step SMSAs, are proposed for effective complexity-performance tradeoffs. Simulation 
results show that with field size up to 256, the two-step SMSA has negligible error performance 
loss compared to the EMS A over the AWGN and Rayleigh-fading channels. The one-step SMSA 
has 0.1 to 0.2 dB loss depending on the field size. Also, the fixed-point study and convergence 
speed research show that it is suitable for hardware implementation. Another important feature of 
SMSA is simplicity. Based on our analysis, the SMSA has much lower computational complexity 
and memory usage compared to other decoding algorithms for NB-LDPC codes. We believe that 
our work for the hardware-efficient algorithm will encourage researchers to explore the use of 
NB-LDPC codes in emerging applications. 
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Fig. 1. BLER and BER comparison of the SMSA-1, SMSA-2, EMSA, MMA, and QSPA with the (1057,813) code over 
GF(2 4 ). The BPSK is used over the AWGN channel. The maximal iteration number K max is set to 50. 



10" 



10" 



10 r 

cc 

LU 

CO _ 4 
CC 10 4 r 
LU 
— I 

m 5 
10 r 

10 

10 

io- pF 



■e- 



3.6 



BLER SMSA-1 
BER SMSA-1 

BLER SMSA-2 5k 
BER SMSA-2 H% 
BLER EMSA *H*N 
BER EMSA \\ 
BLER MMA 
BER MMA 

BLER QSPA \ 
BER QSPA 

i — ' i i i i i 

3.7 3.8 3.9 4 4.1 4.2 4.3 
SNR (dB) 



Fig. 2. BLER and BER comparison of the SMSA-1, SMSA-2, EMSA, MMA, and QSPA with the (495,433) code over GF(2 6 ). 
The BPSK is used over the AWGN channel. The maximal iteration number /t max is set to 50. 
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Fig. 3. BLER and BER comparison of the SMSA-1, SMSA-2, EMSA, MMA, and QSPA with the (273,191) code over GF(2 8 ). 
The BPSK is used over the AWGN channel. The maximal iteration number /t max is set to 50. 
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Fig. 4. BLER and BER comparison of the SMSA-1, SMSA-2, EMSA, MMA, and QSPA with the (495,433) code over GF(2 6 ). 
The 64-QAM is used over the AWGN channel. The maximal iteration number ft max is set to 50. 
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Fig. 5. BLER comparison of the SMSA-1, SMSA-2 (fixed-point and floating-point), QSPA, EMS A, and MMA (floating-point 
only) with the (620,310) code over GF(2 5 ). The BPSK is used over the AWGN channel. The maximal iteration number K ma x is 
set to 50. 
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Fig. 6. BLER comparison of the SMSA-1, SMSA-2, QSPA, EMSA, and MMA with the (620,310) code over GF(2 5 ). The 
BPSK is used over the uncorrelated Rayleigh-fading channel. The maximal iteration number /t max is set to 50. 
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Fig. 7. BLER comparison of the SMSA-2 and EMSA with the (620,310) code over GF(2 5 ). The BPSK is used over the AWGN 
channel. The maximal iteration number K max is set to 4, 5, 7, and 10. 
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Fig. 8. The average number of iterations for the SMSA-2 and EMSA with the (620,310) code over GF(2 5 ). The BPSK is used 
over the AWGN channel. The maximal iteration number K max is set to 20, 50, and 100. 
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